How to use adapter modules?
While we can add adapter modules through all layers of the pretrained network, such a choice is highly parameter inefficient and redundant especially for large networks with many layers. We coarsely categorize the network layers based on their functional forms as bottom, middle and top sections as visualized below.
Bottom layer directly uses the raw images as input. In scenarios, where there is a mismatch between the downstream task's image observations and the pretrained bottom layer feature statistics, the downstream task performance can be sub-optimal. Such scenarios are common for downstream manipulation tasks, since there exists a significant domain gap between the data distribution of pretrained vision models (often in-the-wild data) and standard table-top settings with much closer and non-canonical camera views.
Middle category which contains most of the fixed pretrained network (~90%) weights, is used to extract the appropriate input abstraction. However, these network weights are trained on visual learning tasks which often focus on semantic understanding (e.g. image classification) instead of spatial and causal understanding which are important for control.
Top category uses the spatial representation from the middle category as input and outputs the robot action. This high dimensional spatial representation (size ~20K) is converted into a smaller representation ~2K either via average/max pooling or by down-projecting using 1x1 convolutions or a small shared MLP. Finally, this smaller representation can be used to directly output the action using a linear policy head.